perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712)#1729
Merged
brendancol merged 2 commits intoMay 12, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the GeoTIFF GPU write path’s nvCOMP batch compression by removing per-tile device allocations and per-tile device→host readbacks, bringing the encode side in line with the already-batched decode patterns.
Changes:
- Replace per-tile
cupy.emptycompressed-output allocations with a single contiguous device pool + per-tile slab views/pointers. - Replace per-tile
.get().tobytes()readback with a packedcupy.concatenate(...)+ single.get()and host slicing. - Add a new test module intended to prevent regressions back to the per-tile allocation/readback patterns and to round-trip deflate/zstd writes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Implements the contiguous compressed-output pool and batched D2H readback in _nvcomp_batch_compress. |
xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py |
Adds structural/source assertions and end-to-end round-trip tests for the updated batching behavior. |
Comments suppressed due to low confidence (2)
xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py:106
- These round-trip tests don’t currently guarantee that the nvCOMP GPU compress path actually ran:
write_geotiff_gpufalls back to CPU codecs when_nvcomp_batch_compressreturnsNoneand does not necessarily raiseRuntimeError, so the test can pass while never exercising the changed code. To ensure coverage, skip up-front unless nvCOMP is available/usable (e.g.,_gpu_decode._get_nvcomp() is not None, and/or assert the writer took the nvCOMP branch via a spy/monkeypatch/strict-mode signal).
with tempfile.TemporaryDirectory(prefix="nvcomp_batch_1712_") as td:
path = os.path.join(td, f"roundtrip_{compression}.tif")
try:
write_geotiff_gpu(
darr, path,
compression=compression,
tiled=True,
tile_size=64,
)
except RuntimeError as e:
# nvCOMP may be unavailable in this environment; the writer
# falls back to CPU and that path doesn't exercise the
# change. Skip rather than fail.
pytest.skip(f"nvCOMP unavailable for {compression}: {e}")
xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py:124
- The docstring says this is a “0-tile compress” edge case, but the test exercises a 32x32 single-tile write/read instead and even notes that
n_tiles==0never occurs via the writer. Either adjust the docstring to match what’s being tested (tiny single-tile round-trip) or add a direct unit-level test that actually hits then_tiles == 0branch in_nvcomp_batch_compress(e.g., via a stub nvCOMP handle).
def test_gpu_write_zero_tile_edge_case():
"""A 0-tile compress returns an empty list without indexing into None.
The cumulative-sum / concat path must short-circuit before
``cupy.concatenate`` (which would raise on an empty list). The
pre-fix loop simply iterated zero times, so the contract is the
same empty-list output.
"""
# Direct call into the internal function with n_tiles=0 is
# awkward because it needs a libnvCOMP handle and matching opts.
# Instead, exercise the public writer with a tiny single-tile
# input and confirm the fast path does not crash. Real n_tiles==0
# never occurs via the writer (every image has at least one tile).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2555
to
+2559
| trimmed = [ | ||
| d_comp_bufs[i][:int(comp_sizes[i])] for i in range(n_tiles) | ||
| ] | ||
| d_comp_concat = cupy.concatenate(trimmed) | ||
| host_concat = bytes(d_comp_concat.get()) |
Comment on lines
+24
to
+34
| try: | ||
| import cupy | ||
| _HAS_CUPY = True | ||
| except Exception: | ||
| _HAS_CUPY = False | ||
|
|
||
| # nvCOMP is the entry point that exercises this code path. | ||
| from xrspatial.geotiff import _gpu_decode | ||
|
|
||
| needs_cupy = pytest.mark.skipif( | ||
| not _HAS_CUPY, reason="CUDA / cupy unavailable on this host" |
…ack (#1712) Two related anti-patterns in the GPU compress path remained after the decode-side fix in #1552 and the device-pointer fix in #1659: 1. Per-tile cupy.empty allocations for the compressed-output buffers. For an N-tile write, this issued N memory-pool calls. Replace with one contiguous allocation of size n_tiles * max_cs plus per-tile slab views; the per-tile pointer array still lets nvCOMP write independent slabs in parallel. 2. Per-tile .get().tobytes() in the result-collection loop. Each .get() was a separate D2H transfer on the default stream, and the per-DMA setup cost dominated wall time for large-N writes (the exact pattern #1552 fixed on the decode side). Replace with a single cupy.concatenate of trimmed slabs and one .get(), then slice the host buffer by cumulative offsets to peel out per-tile payloads. The adler32 deflate-wrap step is unchanged. Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd GPU write at tile_size=64 (1024 tiles) drops median from 84.3ms to 54.7ms (~35% reduction). Tests cover the structural change (regressions back to the per-tile patterns fail loudly) plus end-to-end round-trip equality at deflate and zstd compression. _check_gpu_memory now bounds the new contiguous allocation in the same way the decode-side fix does.
958e8f4 to
14cf257
Compare
- Add _check_gpu_memory guard before nvcomp staging-buffer concatenate (mirrors _batched_d2h_to_bytes pattern) - Tighten GPU skip to require cupy.cuda.is_available()
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_nvcomp_batch_compresshad two anti-patterns left over after thedecode-side batched D2H fix in #1552 and the device-pointer fix in
#1659:
Per-tile
cupy.emptyallocations for compressed-output buffers.For an N-tile write this issued N memory-pool calls. Replace with one
contiguous allocation sized
n_tiles * max_csplus per-tile slabviews; the per-tile pointer array still lets nvCOMP write
independent slabs in parallel.
_check_gpu_memorynow bounds thenew allocation.
Per-tile
.get().tobytes()in the result-collection loop. Each.get()was a separate D2H transfer on the default stream, and theper-DMA setup cost dominated wall time for large-N writes (exactly
the pattern Batch GDS-fallback per-tile GPU to host transfer #1552 fixed on the decode side). Replace with one
cupy.concatenateof trimmed slabs and one.get(), then slicethe host buffer by cumulative offsets. The adler32 deflate-wrap step
stays on the host as before.
Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd
GPU write at
tile_size=64(1024 tiles) drops median from 84.3 ms to54.7 ms (~35% reduction). The decode-side fix in #1552 measured a
similar 256ms -> 38ms drop on the analogous LZW path; this PR brings
the encode side into the same shape.
Closes #1712.
Test plan
test_no_per_tile_cupy_empty_in_compressed_poolandtest_no_per_tile_get_in_result_loop: source-level guards that fail loudly if either anti-pattern returns.test_gpu_write_roundtrip_after_batched_compress[deflate]and[zstd]: end-to-end GPU write + read produces the same float32 array, catching any off-by-one in the host-side cumulative offsets.test_gpu_write_zero_tile_edge_case: tiny 32x32 input still round-trips through the fast path.test_gpu_writer_compression_modes_2026_05_11.py,test_gpu_writer_attrs_1563.py,test_gpu_writer_band_first_1580.py,test_gpu_writer_nan_sentinel_1599.py,test_to_geotiff_gpu_fallback_1674.py,test_nvcomp_from_device_bufs_single_alloc_1659.pyall pass.